First we import the libraries that we will need for this project. We will be using MASS to simulate data from a negative binomial distribution and rstan for the HMC implementation.
require("MASS")
require("rstan")
options(mc.cores = parallel::detectCores())
Next we load data for the project. Since we are analyzing survey responses about names and occupations, there are three sets of data. Firstly, we have the survey responses themselves, collected from the Omni study. Then, for each name asked about in the Omni study, we have its age-sex distribution, estimated from the social security administration’s records of everyone born in the last 100 years, and their names at date of birth. Lastly, for each occupation asked about in the Omni study, we have its age-sex distribution, estimated from a separate survey asking respondents of various ages and genders about their occupations.
source("src/load_data/surveys.R")
source("src/load_data/names.R")
source("src/load_data/occs.R")
With the data imported, we also load the functions that will help us simulate the data and plot the fitted models easily.
source("src/funcs/mix_kernel.R")
where we use the additional data we have on the age and gender distributions of \(\mathcal{G}_k\) to integrate over alter age and gender.
The expression above can be simplified to a closed form solution, but before we do that, it’s important to construct our novel framework of modeling the demographic distribution of an ego’s newtork.
[1] We use a kernel to model the age distribution \(\bf{p(a_j)}\) of alters with gender \(\bf{g_j}\) known by an ego of age \(\bf{a_i}\) and gender \(\bf{g_j}\):
\[\begin{align} p(a_j | g_j, a_i, g_i, i \to j) &= \frac{ 1 }{ \sqrt{ 2\pi\lambda_{g_ig_j} } } \exp\biggl\{ -\frac{ (a_i - a_j)^2 }{ 2\lambda_{g_ig_j} } \biggr\} && \\\nonumber &= \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}), \end{align}\]where \(\bf{\lambda_{g_ig_j}}\) is the kernel bandwidth parameter specific to the ego and alter’s genders. This is also called a mixing kernel.
[2] We model the gender distribution of alters known by the ego as elements of a 4-by-4 matrix \(\bf{\rho}\): \[ p(g_j | a_i, g_i, i \to j) = p(g_j | g_i, i \to j) = \rho_{g_ig_j}, \] where the rows of the gender mixing are constrained \(\sum_{g_j} \rho_{g_ig_j} = 1\).
Thus, combining [1] and [2], the alter demographic distribution of an ego’s network can be modeled as: \[\begin{align} p(a_j, g_j | a_i, g_i, i \to j) &= p(g_j | a_i, g_i, i \to j) p(a_j | g_j, a_i, g_i, i \to j) && \\\nonumber &= \rho_{g_ig_j} \mathcal{N}(a_j | a_i, \lambda_{g_ig_j}). \end{align}\]where we approximate the normalized (with respect to age) distribution \(\frac{ p( j \in \mathcal{G}_k | a_j, g_j) }{ \int_{a} p( j \in \mathcal{G}_k | a, g_j) da }\) by the normal distribution \(\mathcal{N}(a_j | \mu_{kg_j}, \sigma_{kg_j}^2)\). The approximating normal’s moments, \(\mu_{kg_j}\) and \(\sigma_{kg_j}^2\), are estimated from \(\mathcal{G}_k\)’s population data, and are analogous to the mean and standard deviation of the age distribution of individuals in \(\mathcal{G}_k\) with gender \(g_j\). Additionally, since survey respondents’ ages are generally observed discretely, we approximate the integral \(\int_{a} p( j \in \mathcal{G}_k | a, g_j) da\) by the summation \(\sum_\alpha \mathbb{P}( j \in \mathcal{G}_k | \alpha, g_j)\) over discrete age \(\alpha \in \{0,1,2,..\}\).
Note that the formulation in [1] above implies that there are four bandwidth values \(\lambda_{g_ig_j}\). This implies that the variance of an ego’s alter age distribution depends on the ego’s gender but is independent of the ego’s age. To add a dependence on ego age, we replace the bandwidth with a functional bandwidth \(\bf{\lambda_{g_ig_j}(a_i)}\). To keep the overall model tractable, we further impose structure on the functional bandwidth by setting it to a fourth order B-spline. We set the knots \(t_q\) of the spline to the deciles of the population age distribution (i.e. 11 knots). Hence, the number of basis functions \(N\) is 13 (= 11 + 4 - 2), and the functional bandwidth becomes: \[ \lambda_{g_ig_j}(a_i) = a_0^{g_ig_j} + \sum_{i=1}^N a_i^{g_ig_j} B_{i,k}(x) \] where \(B_{i,k}(x)\) are the recursively defined spline basis functions.
Since this functional bandwidth does not depend on \(a_j\), it is a constant with respect to the integrals in the Simplification section. Hence, the simplication is updated to: \[\mu_{ik} = d_i \sum_{g_j} \rho_{g_ig_j} \biggl( \sum_{a_j} \frac{ N_{k, a_j, g_j} }{ N_{a_j, g_j} } \biggr) \frac{ e^{ -\frac{ (a_i - \mu_{kg_j})^2 }{ 2(\lambda_{g_ig_j}(a_i) + \sigma_{kg_j}^2) } } }{ \sqrt{ 2\pi(\lambda_{g_ig_j}(a_i) + \sigma_{kg_j}^2) } }\]
We can encapsulate the above framework into the following Stan model:
functions {
// Returns the integration of the product of two normals
real norm_prod_int(int mu1, real mu2, real var1, real var2) {
real K;
K = 1/sqrt(2 * 3.141593 * (var1 + var2)) * exp(-(mu1 - mu2)^2/(2 * (var1 + var2)));
return K;
}
// Builds the B spline with specified parameters
vector build_b_spline(real[] t, real[] ext_knots, int ind, int order);
vector build_b_spline(real[] t, real[] ext_knots, int ind, int order) {
// INPUTS:
// t: the points at which the b_spline is calculated
// ext_knots: the set of extended knots
// ind: the index of the b_spline
// order: the order of the b-spline
vector[size(t)] b_spline;
vector[size(t)] w1;
vector[size(t)] w2;
w1 = rep_vector(0, size(t));
w2 = rep_vector(0, size(t));
if (order==1) {
for (i in 1:size(t)) // B-splines of order 1 are piece-wise constant
b_spline[i] = (ext_knots[ind] <= t[i]) && (t[i] < ext_knots[ind+1]);
}
else {
if (ext_knots[ind] != ext_knots[ind+order-1])
w1 = (to_vector(t) - rep_vector(ext_knots[ind], size(t))) /
(ext_knots[ind+order-1] - ext_knots[ind]);
if (ext_knots[ind+1] != ext_knots[ind+order])
w2 = 1 - (to_vector(t) - rep_vector(ext_knots[ind+1], size(t))) /
(ext_knots[ind+order] - ext_knots[ind+1]);
// Calculating the B-spline recursively as linear interpolation of two lower-order splines
b_spline = w1 .* build_b_spline(t, ext_knots, ind, order-1) +
w2 .* build_b_spline(t, ext_knots, ind+1, order-1);
}
return b_spline;
}
}
data {
int<lower=0> K_1; // number of names
int<lower=0> K_2; // number of occupations
int<lower=1> N; // number of respondents
int<lower=1> N_K; // number of knots for the bandwidth spline
int<lower=1> N_S; // number of grid points to evaluate spline on
int<lower=1> D; // degree of bandwidth spline (order - 1)
real age_mean; // average age
int Y[N,K_1+K_2]; // responses to how many X do you know
real w[N]; // response survey weights
int age[N]; // age of respondents
int<lower=1> g_n[N]; // gender of respondents: 1=female 2=male
int<lower=1> g_k[K_1+K_2*2]; // genders of names: 1=female 2=male (names have one each, but occs have two genders each)
real<lower=0> mu_k[K_1+K_2*2]; // age mean for each gender-group k
real<lower=0> sigma_k[K_1+K_2*2]; // age standard deviation for each gender-group k
real<lower=0> sum_prob_k[K_1+K_2*2]; // sum population proportion of each gender-group k
vector[N_K] knots; // sequence of knots for the bandwidth spline
real X[N_S]; // age grid over which to evaluate spline
real mu_d; // prior mean for the log_degrees
real<lower=0> sigma_d; // prior sd for the log_degrees
real<lower=0> alpha_omega; // prior alpha for inv_omega
real<lower=0> beta_omega; // prior beta for inv_omega
vector<lower=0>[2] alpha_rho; // prior alpha for gender mixing rows
vector[3] mu_beta; // prior mean for beta
vector<lower=0>[3] sigma_beta; // prior sd for beta
real recall_power; // 0 = no recall adjustment, 0.5 = recall adjustment
real degree_regression; // 0 = simple prior, 1 = agesex regression prior
}
transformed data {
int K; // total number of groups
int N_B; // total number of B-splines
matrix[N_K + D - 1, N_S] B; // matrix of B-splines
vector[D + N_K] ext_knots_temp;
vector[2*D + N_K] ext_knots; // set of extended knots
K = K_1 + K_2;
N_B = N_K + D - 1;
ext_knots_temp = append_row(rep_vector(knots[1], D), knots);
ext_knots = append_row(ext_knots_temp, rep_vector(knots[N_K], D));
for (ind in 1:N_B)
B[ind,:] = to_row_vector(build_b_spline(X, to_array_1d(ext_knots), ind, D + 1));
B[N_B, N_S] = 1;
}
parameters {
vector[N] log_d; // log respondent degrees
real<lower=0,upper=1> inv_omega[K]; // inverse group overdispersions
simplex[2] rho[2]; // gender mixing rates
vector[3] beta; // degree regression coefs (third is logged)
real log_eta; // log sd of degree
row_vector[N_B] a_raw[2,2]; // raw spline coefficients
real a0[2,2]; // spline intercept
real<lower=0> tau[2,2]; // spline coefficient prior sd
}
transformed parameters {
real<lower=0> d[N]; // respondent degrees
real<lower=0> omega[K]; // group overdispersions
real<lower=0> eta; // sd of degree
row_vector[N_B] a[2,2]; // spline coefficients
vector[N_S] lambda[2,2]; // spline evaluated at the age grid
for(n in 1:N) {
d[n] = exp(log_d[n]);
}
for(k in 1:K) {
omega[k] = inv(inv(inv_omega[k]) - 1);
}
eta = exp(log_eta);
for(i in 1:2) {
for(j in 1:2) {
a[i,j] = a_raw[i,j] * tau[i,j];
lambda[i,j] = exp(a0[i,j] + to_vector(a[i,j] * B));
}
}
}
model {
vector[N] ex_log_d;
real mu_nk;
int age_;
// Parameter Priors
for(p in 1:3) {
beta[p] ~ normal(mu_beta[p], sigma_beta[p]);
}
log_eta ~ normal(-0.7, 0.1);
inv_omega ~ beta(alpha_omega, beta_omega);
for(i in 1:2) {
rho[i] ~ dirichlet(alpha_rho);
for(j in 1:2) {
a_raw[i,j] ~ normal(0, 2);
a0[i,j] ~ normal(0, 2);
tau[i,j] ~ normal(0, 2);
}
}
// Degree Priors
if(degree_regression == 1) {
for(n in 1:N) {
ex_log_d[n] = beta[1] +
beta[2] * (g_n[n]-1) -
exp(beta[3]) * ((age[n] - age_mean)/age_mean)^2;
}
log_d ~ normal(ex_log_d, eta);
}
else {
log_d ~ normal(mu_d, sigma_d);
}
// Likelihood
for(k in 1:K) {
for(n in 1:N) {
if(Y[n,k] >= 0) {
mu_nk = rho[g_n[n], g_k[k]] *
sum_prob_k[k] *
norm_prod_int(age[n],
mu_k[k],
lambda[g_n[n], g_k[k]][age[n]-17],
sigma_k[k]^2);
// For occupations, we add a term for other gender
if(k > K_1) {
mu_nk = mu_nk + rho[g_n[n], g_k[k+K_2]] *
sum_prob_k[k+K_2] *
norm_prod_int(age[n],
mu_k[k+K_2],
lambda[g_n[n], g_k[k+K_2]][age[n]-17],
sigma_k[k+K_2]^2);
}
if(recall_power > 0) {
mu_nk = pow(mu_nk, 1 - recall_power) * exp((-6 - exp(-7)/mu_nk) * recall_power);
}
mu_nk = mu_nk * d[n];
target += w[n] * neg_binomial_lpmf(Y[n,k] | omega[k] * mu_nk, omega[k]);
}
}
}
}
If we save the above stan model into a separate file, it can then be compiled and stored in memory as follows:
model_kernel_spline <- stan_model(file = "src/stan/degree_kernel_spline.stan")
We demonstrate how to run the fitting for the names-only model. The occupations-only and names-occupations-combined models can be run in a similar way via source("src/fit/kernel_spline_occ.R") and source("src/fit/kernel_spline_comb.R"), respectively.
First, we prepare the parameter values for the B-spline:
age_grid <- seq(min(data$age), max(data$age), 1)
knots <- quantile(age_grid, probs = seq(0, 1, .1))
N_K <- length(knots)
D <- 3
N_B <- N_K + D - 1
Next, we provide some initial values for the parameters in order to speed up convergence:
init_data <- list()
for(i in 1:4) {
init_data[[i]] <- list(log_d = rep(6, nrow(names_data)),
inv_omega = rep(5/6, ncol(names_data)),
a_raw = array(0, dim = c(2,2,N_B)),
a0 = matrix(0, nrow = 2, ncol = 2),
tau = matrix(0.5, nrow = 2, ncol = 2),
beta = c(6, 0.1, -3),
log_eta = 0)
}
Finally, we gather the data (and known prior parameters) that will be passed into Stan:
mcmc_data <- list(K_1 = ncol(names_data),
K_2 = 0,
N = nrow(names_data),
KG = ncol(names_data),
N_K = N_K,
N_S = length(age_grid),
D = D,
age_mean = age_mean_2014,
Y = as.matrix(names_data),
w = data$wgt,
age = data$age,
g_n = as.numeric(data$Gender),
g_k = c(rep(1, 6), rep(2, 6)),
mu_k = mu_k_name,
sigma_k = sigma_k_name,
sum_prob_k = sum_prob_k_name,
knots = knots,
X = age_grid,
mu_d = 6, sigma_d = 0.6,
alpha_omega = 4.5, beta_omega = 0.5,
alpha_rho = c(5,5),
mu_beta = c(6, 0.1, -3), sigma_beta = rep(1, 3),
recall_power = 0,
degree_regression = 1)
Now we are ready to fit the model in Stan (which takes a couple hours) and extract the posterior draws of the parameters:
fit_names_kernel_spline <- sampling(model_kernel_spline,
data = mcmc_data,
iter = 2000,
init = init_data)
Once the stan run has completed, we can extract the posterior means/medians of the parameters:
degree <- list()
degree$NameSexAgeKernelSpline <- apply(extract(fit_names_kernel_spline)$d,2,median)
rho_names_kernel_spline <- apply(extract(fit_names_kernel_spline)$rho, c(2,3), mean)
beta_names_kernel_spline <- colMeans(extract(fit_names_kernel_spline)$beta)
eta_names_kernel_spline <- mean(extract(fit_names_kernel_spline)$eta)
[1] First we take a look at the fitted degree distributions for 6 different egos (young men/women, middle-aged men/women, and old men/women).
par(mfrow=c(2,3))
plot_degrees(degree$NameSexAgeKernelSpline, ego_sex_age, ego_names_verbose, data$wgt, rm = rm_na_names)
[2] Next we plot the kernel bandwidths as a function of ego/alter gender and ego age.
plot_kernel_spline(age_grid, fit_names_kernel_spline, overlay = T, ylim = c(18,55))
plot_kernel_spline(age_grid, fit_names_kernel_spline, ylim = c(18,55))
[3] Finally, we plot the alter demographic kernels for the 6 different egos from [1].
par(mfrow=c(3,2))
plot_kernels(fit_ = fit_names_kernel_spline,
rho_ = rho_names_kernel_spline,
y_cent = 70,
y_index = floor(70)-17,
y_min = 0,
y_max = 100)
plot_kernels(fit_ = fit_names_kernel_spline,
rho_ = rho_names_kernel_spline,
y_cent = age_mean_2014,
y_index = floor(age_mean_2014)-17,
y_min = 0,
y_max = 100)
plot_kernels(fit_ = fit_names_kernel_spline,
rho_ = rho_names_kernel_spline,
y_cent = 21,
y_index = floor(21)-17,
y_min = 0,
y_max = 100)